| Objective | Complete |
|---|---|
| Transform data using tidyverse to prepare for compound visualizations | |
| Visualize the transformed data in a boxplot and a scatterplot with ggplot2 |
ggplot2As noticed earlier, each chart in a ggplot2 package is like a layered cake, with elements built over a base
ggplot2 is a part of the tidyverse collection in R packagesggplot2 works best with ggplot2We have already created a simple histogram and scatterplot for a single variable from our dataset
But the next step is to compare the distributions of variables
To do so, we will create a series of normalized boxplots, allowing us to compare each variable’s relative variation in data
We will then create a complex scatterplot of the normalized values
tidyverse package that includes ggplot2ggplot2 theme# Save our custom `ggplot` theme to a variable.
my_ggtheme = theme_bw() +
theme(axis.title = element_text(size = 20),
axis.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 18),
plot.title = element_text(size = 25),
plot.subtitle = element_text(size = 18)) gathergather() for easy visualizationggplot2 requires the data to be in this formattidyverse version has pivot_long() which does the same work as gather() but is still under development variable value
1 age 67
2 age 61
3 age 80
4 age 49
5 age 79
6 age 81
variable value
15325 bmi 18.60000
15326 bmi 28.89324
15327 bmi 40.00000
15328 bmi 30.60000
15329 bmi 25.60000
15330 bmi 26.20000
variable value
1 age 67
2 age 61
3 age 80
4 age 49
5 age 79
6 age 81
variable value
15325 bmi 18.60000
15326 bmi 28.89324
15327 bmi 40.00000
15328 bmi 30.60000
15329 bmi 25.60000
15330 bmi 26.20000
| Objective | Complete |
|---|---|
| Transform data using tidyverse to prepare for compound visualizations |
✔ |
| Visualize the transformed data in a boxplot and a scatterplot with ggplot2 |
geom_boxplot()# Let's normalize the data and then create boxplots.
health_subset_long = health_subset_long %>%
group_by(variable) %>% #<- group values by variable
mutate(norm_value = #<- make `norm_value` column
value/max(value, #<- divide value by group max
na.rm = TRUE)) #<- don't forget the NAs!
head(health_subset_long)# A tibble: 6 x 3
# Groups: variable [1]
variable value norm_value
<chr> <dbl> <dbl>
1 age 67 0.817
2 age 61 0.744
3 age 80 0.976
4 age 49 0.598
5 age 79 0.963
6 age 81 0.988
my_ggtheme to itfill color, use the function guides() and pass fill = FALSE to itguides()# Make outliers stand out with red color and bigger size.
boxplots_norm = boxplots_norm + #<- previously saved plot
geom_boxplot(outlier.color = "red", #<- adjust outlier color
outlier.size = 5) + #<- adjust outlier size
labs(title = "Health data variables", #<- add title and subtitle
subtitle = "Boxplot of scaled data")age from all other variablesage corresponds to an entry in all other variablesage variablehealth_subset_long2 = health_subset %>%
gather(avg_glucose_level:bmi, #<- gather all variables but `age`
key = "variable", #<- set key to `variable`
value = "value") %>% #<- set value to `value`
group_by(variable) %>%
mutate(norm_value = value/max(value, na.rm = TRUE))
# Inspect the data.
head(health_subset_long2)# A tibble: 6 x 4
# Groups: variable [1]
age variable value norm_value
<dbl> <chr> <dbl> <dbl>
1 67 avg_glucose_level 229. 0.842
2 61 avg_glucose_level 202. 0.744
3 80 avg_glucose_level 106. 0.390
4 49 avg_glucose_level 171. 0.630
5 79 avg_glucose_level 174. 0.641
6 81 avg_glucose_level 186. 0.685
# Create a base plot.
base_norm_plot = ggplot(data = health_subset_long2, #<- set data
aes(x = norm_value, #<- set x-axis to represent normalized value
y = age, #<- y-axis to represent `age`
color = variable)) + #<- set color to depend on `variable`
my_ggtheme #<- set theme
base_norm_plot# Create a scatterplot.
scatter_norm = base_norm_plot + #<- base plot
geom_point(size = 3, #<- add point geom with size of point = 3
alpha = 0.7) #<- make it 70% opaque
# View updated plot.
scatter_norm# Adjust scatterplot to include 2D density.
scatter_norm = scatter_norm + #<- previously saved plot
geom_density2d(alpha = 0.7) #<- add 2D density geom with 70% opaque color
# View updated plot.
scatter_normfacet_wrap() function, which splits the data by one or more variables and plots the subsets together# Add finishing touches to the plot.
scatter_norm = scatter_norm + #<- previously saved plot
guides(color = FALSE) + #<- remove legend for color mappings
theme(strip.text.x = element_text(size = 14)) + #<- increase text size in strips of facets
labs(title = "Health data: Age vs. other variables",#<- add title and subtitle
subtitle = "2D distribution of scaled data")PNG, JPEG, BMP, and PDF
PNG and PDF formatspng() function opens the R graphics device and lets us save our plots in PNG file formatpng("Name_of_file.png", #<- name of file
width = 400, #<- width of image
height = 300, #<- height of image
units = "px") #<- units for height & width
plot 1 #<- call the plot object you want to export
dev.off() #<- closes R graphics devicedev.off() command allows clearing R graphics device so that we can continue working with our plotsbmp, jpeg, and other graphic export commands use a similar command formatThere are a few advantages of saving plots to a PDF format as opposed to an image:
pdf() function follows the same syntax as png()You are now ready to try tasks 8-16 in the Exercise for this topic
| Objective | Complete |
|---|---|
| Transform data using tidyverse to prepare for compound visualizations |
✔ |
| Visualize the transformed data in a boxplot and a scatterplot with ggplot2 |
✔ |
In this part of the course, we have covered the following concepts: